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Abstract 

We investigate the issue of model selection and the use of the nonconformity (strangeness) measure in 
T-H batch learning. Using the nonconformity measure we propose a new training algorithm that helps avoid 

>^ the need for Cross- Validation or Leave-One-Out model selection strategies. We provide a new general- 

isation error bound using the notion of nonconformity to upper bound the loss of each test example and 
^ show that our proposed approach is comparable to standard model selection methods, but with theoreti- 

cal guarantees of success and faster convergence. We demonstrate our novel model selection technique 

using the Support Vector Machine. 
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1 Introduction 

Model Selection is the task of choosing the best model for a particular data analysis task. It generally makes a 
compromise between fit with the data and the complexity of the model. Furthermore, the chosen model is used in 
subsequent analysis of test data. Currently the most popular techniques used by practitioners are Cross- Vahdation 
(CV) and Leave-One-Out (LOO). 

In this paper the model we concentrate on is the Support Vector Machine (SVM) (Boser et al., 1992). CV and LOO 
are the modus operandi despite there being a number of alternative approaches proposed in the SVM Uterature. For 
instance, Chapelle and Vapnik (1999) explore model selection using the span of the support vectors and re-scaling 
of the feature space, whereas. Momma and Bennett (2002), motivated by an application in drug design, propose 
a fully-automated search methodology for model selection in SVMs for regression and classification. Gold and 
Sollich (2003) give an in depth review of a number of model selection altematives for tuning the kernel parameters 
and penalty coefficient C for SVMs, and although they find a model selection technique that performs well (at 
high computational cost), the authors conclude that "the hunt is still on for a model selection criterion for SVM 
classification which is both simple and gives consistent generalisation performance" . More recent attempts at model 
selection have been given by Hastie et al. (2004) who derive an algorithm that fits the entire path of SVM solutions for 
every value of the cost parameter, while Li et al. (2005) propose to use the Vapnik-Chervonenkis (VC) bound; they put 
forward an algorithm that employs a coarse-to-fine search strategy to obtain the best parameters in some predefined 
ranges for a given problem. Furthermore, Ambroladze et al. (2006) propose a tighter PAC-Bayes bound to measure 
the performance of SVM classifiers which in turn can be used as a way of estimating the hyperparameters. Finally, 
de Souza et al. (2006) have addressed model selection for multi-class SVMs using Particle Swarm Optimisation. 

Recently, Ozogur-Akyuz et al. (In Press), following on work by Ozogiir et al. (2008), show that selecting a model 
whose hyperplane achieves the maximum separation from a test point obtains comparable error rates to those found 
by selecting the SVM model through CV. In other words, while methods such as CV involve finding one SVM model 
(together with its optimal parameters) that minimises the CV error, Ozogiir-Akyiiz et al. (In Press) keep all of the 
models generated during the model selection stage and make predictions according to the model whose hyperplane 
achieves the maximum separation from a test point. The main advantage of this approach is the computational saving 
when compared to CV or LOO. However, their method is only appUcable to large margin classifiers like SVMs. 

We continue this line of research, but rather than using the distance of each test point from the hyperplane we explore 
the idea of using the nonconformity measure (Vovk et al., 2005; Shafer & Vovk, 2008) of a test sample to a particular 
label set. The nonconformity measure is a function that evaluates how 'strange' a prediction is according to the 
different possibilities available. The notion of nonconformity has been proposed in the on-line learning framework 
of conformal prediction (Shafer & Vovk, 2008), and is a way of scoring how different a new sample is from a bag' 
of old samples. The premise is that if the observed samples are well-sampled then we should have high confidence 
on correct prediction of new samples, given that they conform to the observations. 

We take the nonconformity measure and apply it to the SVM algorithm during testing in order to gain a time advantage 
over CV and to generalise the algorithm of Ozogiir-Akyiiz et al. (In Press). Hence we are not restricted to SVMs (or 
indeed a measure of the margin for prediction) and can apply our method to a broader class of learning algorithms. 
However, due to space constraints we only address the SVM technique and leave the application to other algorithms 
(and other nonconformity measures not using the margin) as a future research study. Furthermore we also derive a 
novel learning theory bound that uses nonconformity as a measure of complexity. To our knowledge this is the first 
attempt at using this type of measure to upper bound the loss of learning algorithms. 

The paper is laid out as follows. In Section 2 we present the definitions used throughout the paper. Our main 

algorithmic contributions are given in Section 3 where we present our nonconformity measure and its novel use in 
prediction. Section 4 presents a novel generalisation error bound for our proposed algorithm. Finally, we present 
experiments in Section 5 and conclude in Section 6. 

2 Definitions 

The definitions are mainly taken from Shafer and Vovk (2008). 

Let {xi,yi) be the i\h input-output pair from an input space X and output space Y. Let Zi = {xi, y,) denote short 
hand notation for each pair taken from the joint space Z := X x Y. 

We define a nonconformity measure as a real valued function A{S, z) that measures how different a sample z is from 
^ A bag is a more general formalism of a mathematical set that allows repeated elements. 
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a set of observed samples S = {zi, . . . , Zm}- A nonconformity measure must be fixed a priori before any data has 
been observed. 

Conformal predictions work by making predictions according to the nonconformity measure outHned above. Given 
a set 5* = {zi, . . . , Zm} of training samples observed over t = 1, . . . , to time steps and a new sample x, a conformal 
prediction algorithm will predict y from a set containing the correct output with probability 1 — e. For example, 
if e = 0.05 then the prediction is within the so-called prediction region - a set containing the correct y, with 95% 
probability. In this paper, we extend this framework to the batch learning model to make predictions using confidence 
estimates, where for example we are 95% confident that our prediction is correct. 

In the batch learning setting, rather than observing samples incrementally such as xi,yi, . . . ,Xm,ym we have a 
training set S = {(xi, jji). .... (./',„, j/m)} containing all the samples for training that are assumed to be distributed 
i.i.d. from a fixed (but unknown) distribution V. Given a function (hypothesis) space H the batch algorithm takes 
training sample S and outputs a hypothesis / : X i— > Y that maps samples to labels. 

For the SVM notation let </> : X i-> F map the training samples to a higher dimensional feature space F. The primal 
SVM optimisation problem can be defined Uke so: 

min^,6 Iklli + CX^"^i^i 
subject to yi {{w, 4>{xi)) + b)>l — ^i 
i = 1, . . . ,n. 

where b is the bias term, ^ G M" is the vector of slack variables and w G M" is the primal weight vector, whose 
2-norm minimisation corresponds to the maximisation of the margin between the set of positive and negative sam- 
ples. The notation (•, •) denotes the inner product. The dual optimisation problem gives us the flexibility of using 
kernels to solve nonhnear problems (Scholkopf & Smola, 2002; Shawe- Taylor & Cristianini, 2004). The dual SVM 
optimisation problem can be formulated like so: 

Em 1 \-^m f \ 

subject to X^ili ViO^i = 0, 
< aj < C, 

where •) is the kernel function and a E M™ is the dual (Lagrangian) variables. Throughout the paper we will 
use the dual optimisation formulation of the SVM as we attempt to find the optimal regularisation parameter for the 
SVM together with the optimal kernel parameters. 

3 Nonconformity Measure 

We now discuss the main focus of the paper. Let S = S'tm U S'vai be composed of a training set S'tm and a vaUdation 
set ^vai. We assume without loss of generality that, 

D — • • • ) ^mt ■ ■ ■ 1 ^ni 

where St^ = {z\,..., zf^} and S'vai = {^^i , • • • , z^}. 

We start by defining our nonconformity measure A{Syai, z) for a fimction / over the validation set 5vai and j = 
1, . . . , n as, 

A{Sy,i,z)=yfix). (1) 

Note that this does not depend on the whole sample but just the test point. In itself it does not characterise how 
different the point is. To do this we need the so called p-value PA(S'vai) z) that computes the fraction of points in S'vai 
with 'stranger' values: 

,^ , \{1<J <n:A{Sy,i,z])<AiSy,i,z)}\ 

PA{Sye.UZ) = —, 

n 

which, in this case, measures the number of samples from the validation set that have smaller functional margin than 
the test point functional margin. The larger the margin obtained the more confidence we have in our prediction. The 
nonconformity p-value of z is between 1 and 1 /n. If it is small (tends to 1 /n) then sample z is non-conforming and 
if it is large (tends to 1) then it is conforming. 

In order to better illustrate this idea we show a simple pictorial example in Figure 1. We are given six vahdation 
samples ordered around (solid line) in terms of their correct/incorrect classification i.e., the value y'" f{x'") for an 
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{x" , y") = z'" pair will be correctly classified by / iff y'"f{x'") > 0. In our example two are incorrectly classified 
(below the threshold) and four are correct. The picture on the left also includes yf{x) for a test sample x when its 
label is considered to be positive i.e., y = +1. In this case there remain 3 validation samples below its value of yf{x) 
giving us a nonconformity measure p-value using Equation (2) as PAiSvaX, {x, y = +1)) = |. A similar calculation 
can be made for the picture on the right when we consider the label y = —1 for test point x i.e., {x,y = — 1). 
We are able to conclude, for this sample, that assigning x a label of y = +1 gives a nonconformity p-value of 
P^(S'vai, = 5 while assigning a label of y = — 1 gives a p-value of PA{Sva,i, {x, —1)) = |. Therefore, 

with a higher probability, our test sample x is conforming to +1 (or equally non-conforming to —1) and should 
be predicted positive. We state the standard result for nonconformity measures, but first define a nonconformity 



Validation samples 9 
Test sample 



> -I 



Figure 1: A simple illustrative example of non-conformal prediction using a validation set of six samples (2 are 
misclassifications, 4 are correctly classified) on a single test sample with a positive functional value and its two label 
possibilities of (left) and —1 (right). 



prediction scheme and its associated error. 

Definition 3.1. For a fixed nonconformity measure A{S, z), its associated p-value, and e > 0, the confidence predic- 
tor predicts the label set 

r'{S,x) = {y:pA{S, {x,y))>e} 
The confidence predictor makes an error on sample z = (x, y) ify ^ T^{S, x). 
Proposition 3.2. For exchangeable distributions we have that 

P^+^{{S,z):y^T\S,x)}<e. 



Proof. By exchangeability all permutations of a training set are equally likely. Denote with S the set S extended 
with the sample Zn+i and for a a permutation of n -h 1 objects. Let be the sequence of samples permuted by a. 
Consider the permutations for which the corresponding prediction of the final element of the sequence is not an error. 
This implies that the value A{Scr, ^^^(n+i)) is in the upper 1 — e fraction of the values A{Sm z^(i-j),i ~ 1, . . . , n + 1. 
This will happen at least 1 — e of the time under the permutations, hence upper bounding the probabiUty of error over 
all possible sequences by e as required. □ 



Following the theoretical motivation from Shawe-Taylor (1998) we proceed by computing all the SVM models and 
applying them throughout the prediction stage. A fixed validation set, withheld from training, is used to calculate the 
nonconformity measures. We start by constructing K SVM models so that each decision function fkGFis in the 
set F of decision functions with k = 1, . . . ,K. The different set of SVM models can be characterised by different 
regularisation parameters for C (or i/ in I/-SVM) and the width parameter 7 in the Gaussian kernel case. For instance, 
given 10 C = {Ci, . . . , Cio} values and 10 7 = {71, . . . , 710} values for a Gaussian kernel we would have a total 
of |C| X I7I = 100 SVM models, where | • | denotes the cardinality of a set. 

We now describe our new model selection algorithm for the SVM using nonconformity. If the following 

|{Vj : yjfkjxj) < yfk{x)}\ ^ ^ 

n 
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statement holds, then we include y €T\ where F is the prediction region (set of labels conforming). For classifica- 
tion, the set F can take the following values: 

{0},{-l},{+l},{-l,+l}. 

Clearly finding the prediction region F = {—1} or F = {+1} is useful in the classification scenario as it gives higher 
confidence of the prediction being correct, while the sets F = {0} and F = {— 1, +1} are useless as the first abstains 
from making a prediction whilst the second is unbiased towards a label. 

Let icrit be the critical e that creates one label in the set for at least one of the K models: 

. \{^3--y]h{x^)<yfk{x)}\ 

EcHt = mm mm . (2) 

keKye{-i,+i} n 

Furthermore, let kcrit, Vcrit be arguments that realise the minimum ecrit, chosen randomly in the event of a tie. This 
now gives the prediction of a; as y = —ycrit- This is because ycrit is non-conforming (strange) and we wish to select 
the opposite (conforming) label. In the experiments section we refer to the prediction strategy outlined above and 
the model selection strategy given by equation (2) as the nonconformity model selection strategy. We set out the 
pseudo-code for this procedure in Algorithm 1. 

Algorithm 1 Nonconformity model selection. 

Input: Sample 5 = {{xi, SVM parameters C and 7 (for Gaussian kernel) where = |C| x I7I 

Output: Predictions of test points x^+i, x^_|_2, • • • 

1: Take training data S and randomly split into training set S'tm = {{x\,yl), ■ ■ ■ {Xm,ym)} and validation set 
Sva.\ = {{x I, yi), ■■■ ,{x^,y^)} where m + n = £ {This split is only done once}. 

2: Train K SVM models on training data Stm to find /i (•),... , fxi;)- 

3: Prediction Procedure: For a test point x compute: 



^crit = mm mm 

k&K ye{-l,+l} 



\{^3-yVk{x'^)<yfk{x)}\ 



realised by fc = kcrit and y = ycrit- 
4: Predict label —ycrit for x. 



Before proceeding we would Uke to clarify some aspects of the Algorithm. The data is spUt into a training and 
validation set once and therefore all K models are computed on the training data - after this procedure we only 
require to calculate the nonconformity measure p- value for all test points in order to make predictions. However, in 
&-fold Cross- Vahdation we require to train, for each C and 7 parameter, a further b times. Hence CV will be at most 
b times more computationally expensive. 

4 Nonconformity Generalisation Error Bound 

The problem with Proposition 3.2 is that it reqiures the validation set to be generated afresh for each test point, 
specifies just one value of e, and only applies to a single test fiinction. In our appUcation we would hke to reuse the 
validation set for all of our test data and use an empirically determined value of e. Furthermore we would like to use 
the computed errors for different functions in order to select one for classifying the test point. 

We therefore need to have uniform convergence of empirical estimates to true values for all values of e and all 
functions K. We first consider the question of uniform convergence for all values of e. 

If we consider the cumulative distribution function ^(7) defined by 

F(7) = P((x,y) <7), 

we need to bound the difference between empirical estimates of this function and its true value. This corresponds to 
bounding the difference between true and empirical probabilities over the sets 

A = {(-00, a] : o G K} . 

Observe that we cannot shatter two points of the real line with this set system as the larger cannot be included in a 
set without the smaller. It follows that this class of fimctions has Vapnik-Chervonenkis (VC) dimension 1. We can 
therefore apply the following standard result, see for example Devroye et al. (1996). 
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Theorem 4.1. Let X be a measurable space with a fixed but unknown probability distribution P. Let Abe a set 
system over X with VC dimension d and fix S > 0. With probability at least 1 — 6 over the generation of an i.i.d. 

m-sample S C X, 



\snA\ 



m 



P{A) 



< 5.66a 



/dln(^)+ln 



m 



We now apply this result to the error estimations derived by our algorithm for the K possible choices of model. 

Proposition 4.2. Fix S > 0. Suppose that the validation set Syai of size n in Algorithm 1 has been chosen i.i.d. 
according to a fixed but unknown distribution that is also used to generate the test data. Then with probability at 
least 1 — 5 over the generation ofSyai, if for a test point x the algorithm returns a classification y'" = —ycrit, using 
function fkcrw 1 — ^crit < K, realising a minimum value oftcrit, then the probability ofmisclassification satisfies 

^ X /in (en) + ln^ 
P {{x, y):y^y'')< e^u + 5.66^ ^ ' ^ ^. 



Proof. We apply Theorem 4.1 once for each function fk, 1 < k < K with S replaced by S/K. This implies that 
with probability 1 — 5 the bound holds for all of the functions fk, including the chosen fk^^.^. For this function the 
empirical probability of the label ycrit being observed is Ccrit^ hence the true probability of this opposite label is 
bounded as required. □ 

Remark 4.3. The bound in Proposition 4.2 is applied using each test sample which in turn gives a different bound 
value for each test point (e.g., see Shawe-Taylor (1998)). Therefore, we are unable to compare this bound with 
existing training set CV bounds (Keams & Ron, 1999; Zhang, 2001 ) as they are traditional a priori bounds computed 
over the training data, and which give a uniform value for all test points (i.e., training set bounds (Langford, 2005)). 



5 Experiments 

In the following experiments we compare SVM model selection using traditional CV to our proposed nonconformity 
strategy as well as to the model selection using the maximum margin (Ozogur-Akyuz et al., In Press) from a test 
sample. 

We make use of the Votes, Glass, Haberman, Bupa, Credit, Pima, BreastW and Ionosphere data sets acquired from the 
UCI machine learning repository.-^ The data sets were pre-processed such that samples containing unknown values 
and contradictory labels were removed. Table 1 lists the various attributes of each data set. The LibSVM package 
2.85 (Chang & Lin, 2001) and the Gaussian kernel were used throughout the experiments. Model selection was 



Table 1: Description of data sets: Each row contains the name of the data set, the number of samples and features 
(i.e. attributes) as well as the total number of positive and negative samples. 



Data set 


# Samples 


# Features 


# Positive Samples 


# Negative Samples 


Votes 


52 


16 


18 


34 


Glass 


163 


9 


87 


76 


Haberman 


294 


3 


219 


75 


Bupa 


345 


6 


145 


200 


Credit 


653 


15 


296 


357 


Pima 


768 


8 


269 


499 


BreastW 


683 


9 


239 


444 


Ionosphere 


351 


34 


225 


126 



carried out for the values listed in Table 2. 

In the experiments we apply a 10-fold CV routine where the data is split into 10 separate folds, with 1 used for testing 
and the remaining 9 split into a training and validation set. We then use the following procedures for each of the two 
model selection strategies: 

^http://archive.ics.uci.edu/m]/ 
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Table 2: Model selection values for 7 and C for both cross-validation and nonconformity measure. 

7 = {2-l^2-^^2-",2-^2-^2-^2-^2-^2^23} 
c = {2-^2-^2-^2^2^2^2^2^2",2l^2l^} 



• Nonconformity: split the samples into a training and validation set of size min(g£, 50) where i is the number 
of samples.^ Using the training data we learn all models using C and 7 from Table 2. 

• Cross-Validation: carry out a 10-fold CV only on the training data used in the Nonconformity procedure to find 
the optimal C and 7 from Table 2. 

The validation set is excluded from training in both methods, but used for prediction in the nonconformity method. 
Hence, the samples used for training and testing were identical for both CV and the nonconformity model selection 
strategy. We feel that this was a fair comparison as both methods were given the same data samples from which to 
train the models. 

Table 3 presents the results where we report the average error and standard deviation for Cross-Validation and the 
nonconformity strategy. We are immediately able to observe that carrying out model selection using the nonconfor- 
mity measure is, on average, a factor of 7.3 times faster than using CV. The results show that (excluding the Haberman 
data set) nonconformity seems to perform similarly to CV in terms of generalisation error. However, lower values 
for the standard deviation on Votes, Glass, Bupa and Credit suggest that on these data sets nonconformity gives more 
consistent results than CV. Furthermore, when excluding the Haberman data set, the overall error for the model se- 
lection using nonconformity is 0.1730 ± 0.0659 and CV is 0.1686 ± 0.0886, constituting a difference of only 0.0044 
(less than half a percent) in favour of CV and a standard deviation of 0.0227 in favour of the nonconformity approach. 
We hypothesise that the inferior results for Haberman are due to the very small numbers of features (only 3). 

We also compare the nonconformity strategy to the SVM L^o maximum margin approach (Ozogiir-Akyiiz et al.. In 
Press). The SVM Loo selects the model with the maximum margin from the test sample in order to make predic- 
tions. Once again, the training and testing sets were identical for both methods. Observe that despite the L^o being 
approximately 7s faster (on average) than our proposed method, we obtain an improvement of 0.0251 ± 0.0108. 
Hence, bringing us closer to the CV error rate (nonconformity is overall only 1.17% worse than CV when including 
the Haberman dataset and 0.44% worse when excluding). In fact we obtain lower error rates, than SVM L^, on all 
datasets except for Credit (but with a smaller standard deviation). 

Since we do not have a single number for the bound on generalisation (as traditional bounds) but rather individual 
values for each test sample, it is not possible to simply compare the bound with the test error. In order to show how 
the bound performs we plot the generalisation error as a function of the bound value. 

For each value of the bound we take the average error of all test points with predicted error less than or equal to 
that value. In other words, we create a set"* B containing the various bound values computed on the test samples. 
Subsequently, for each element in the set i.e., 'ii^h e B we compute the average error value for the test samples that 
have a bound value that is smaller or equal to 6,. 

Figure 2 shows a plot of this error rate as a function of the bound value. The final value of the function is the overall 
generalisation error, while the lower error rates earlier in the curve are those attainable by filtering at different bound 
values. As expected the error increases monotonically as a function of the bound value. Clearly there is considerable 
weakness in the boimd, but this is partly a result of our using a quite conservative VC bound - our main aim here is 
to show that the predictions are correlated with the actual error rates. 

We believe these results to be encouraging as our theoretically motivated model selection technique is faster and 
achieves similar error rates to Cross-Validation, which is generally considered to be the gold standard. We also find 
that the nonconformity strategy is slightly slower than the maximum margin approach but performs better in terms of 
generalisation error. 



'The size of the validation set was varied without much difference in generalisation error. 
''Hence, no repetition of identical bound values are allowed. 
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Figure 2: The generalisation error as a function of the bound value for a single teain-test split of the Bupa data set. 
The final value of the function is the overall generaUsation error. 



6 Discussion 

We have presented a novel approach for model-selection and test sample prediction using a nonconformity 
(strangeness) measure. Furthermore we have given a novel generalisation error bound on the loss of the learning 
method. The proposed model selection approach is both simple and gives consistent generahsation performance 
(Gold & SolUch, 2003). 

We find these results encouraging as it constitutes a much needed shift from costly model selection based approaches 
to a faster method that is competitive in terms of generalisation error. Furthermore, in relation to the work of Ozogiir- 
Akyiiz et al. (In Press) we have presented a method that is 1) not restricted to SVMs and 2) can use measures other 
than the margin to make predictions. Therefore the nonconformity measure approach gives us a general way of 
choosing to make predictions, allowing us the flexibility to apply it to algorithms that are not based on large margins. 
In future work we aim to investigate the appUcability of our proposed model selection technique to other learning 
methods. Another future research direction is to apply different nonconformity measures to the SVM algorithm 
presented in this paper such as, for example, a nearest neighbour nonconformity measure (Shafer & Vovk, 2008). 
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